Skip to content

Conversation

@chriselrod
Copy link
Contributor

Adding the @inlines

julia> @benchmark ForEach(expm!, $B, $As)()
BenchmarkTools.Trial: 8692 samples with 1 evaluation.
 Range (min  max):   88.607 μs    1.523 ms  ┊ GC (min  max):  0.00%  87.83%
 Time  (median):      95.456 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   110.611 μs ± 120.857 μs  ┊ GC (mean ± σ):  11.04% ±  9.27%

  █      ▁                                                      ▁
  █▇▄▁▄▄▅█▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▆ █
  88.6 μs       Histogram: log(frequency) by time       1.12 ms <

 Memory estimate: 545.41 KiB, allocs estimate: 119.

Prior to that, performance was around 215 microseconds.

@chriselrod
Copy link
Contributor Author

@KristofferC ?

@chriselrod
Copy link
Contributor Author

Running some internal proprietary benchmark, the latest ForwardDiff release gives me

 18.568555 seconds (24.68 M allocations: 1.288 GiB, 2.00% gc time, 2.42% compilation time)
  7.463995 seconds (24.35 M allocations: 1.268 GiB, 2.79% gc time, 0.14% compilation time)
 18.064215 seconds (24.40 M allocations: 1.270 GiB, 1.72% gc time)
  7.450395 seconds (24.35 M allocations: 1.268 GiB, 2.67% gc time)

versus this PR

 13.992152 seconds (24.72 M allocations: 1.295 GiB, 2.51% gc time, 3.18% compilation time)
  6.132790 seconds (24.45 M allocations: 1.277 GiB, 4.11% gc time, 0.18% compilation time)
 13.443605 seconds (24.44 M allocations: 1.277 GiB, 2.11% gc time)
  6.045382 seconds (24.45 M allocations: 1.277 GiB, 3.19% gc time)

The difference is partially because my computer has AVX512.

If I start Julia with -C'native,-prefer-256-bit' to allow the autovectorizer to use 512 bit vectors, the latest ForwardDiff release gives me

 15.560895 seconds (24.68 M allocations: 1.288 GiB, 2.31% gc time, 2.70% compilation time)
  6.524974 seconds (24.35 M allocations: 1.268 GiB, 3.69% gc time, 0.17% compilation time)
 15.263915 seconds (24.40 M allocations: 1.270 GiB, 2.00% gc time)
  6.641746 seconds (24.35 M allocations: 1.268 GiB, 3.04% gc time)

while this PR is unchanged (as it isn't relying on the autovectorizer).

Compile times appear unchanged.
This PR is an easy free performance win.

I am not sure if retval not being inlined caused the problem we observed before the initial reverting, but I would like to see this move forward.

Or I'm just going to give up and create my own ForwardDiff, since duplicated effort tends to be smaller than collaboration's cost.

@KristofferC
Copy link
Collaborator

I missed this. Wouldn't the "correct" way to do this be to rebase the original branch on master, force push that commit and then add this above. As it is right now, it is kind of unreviewable.

@KristofferC
Copy link
Collaborator

Or I'm just going to give up and create my own ForwardDiff,

Hehe, yeah, replicating ForwardDiff is pretty fast. I tried something with Hessians using SIMD.jl in https://github.com/KristofferC/HyperHessians.jl with some decent perf results.

Comment on lines +199 to +215
@inline function dual_definition_retval(::Val{T}, val::Real, deriv::Real, partial::Partials) where {T}
return Dual{T}(val, deriv * partial)
end
@inline function dual_definition_retval(::Val{T}, val::Real, deriv1::Real, partial1::Partials, deriv2::Real, partial2::Partials) where {T}
return Dual{T}(val, _mul_partials(partial1, partial2, deriv1, deriv2))
end
@inline function dual_definition_retval(::Val{T}, val::Complex, deriv::Union{Real,Complex}, partial::Partials) where {T}
reval, imval = reim(val)
if deriv isa Real
p = deriv * partial
return Complex(Dual{T}(reval, p), Dual{T}(imval, zero(p)))
else
rederiv, imderiv = reim(deriv)
return Complex(Dual{T}(reval, rederiv * partial), Dual{T}(imval, imderiv * partial))
end
end
@inline function dual_definition_retval(::Val{T}, val::Complex, deriv1::Union{Real,Complex}, partial1::Partials, deriv2::Union{Real,Complex}, partial2::Partials) where {T}
Copy link
Contributor Author

@chriselrod chriselrod Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KristofferC adding these 4 @inlines to dual_definition_retval should be the only actual change.
I can't rebase your original PR. Mind doing so?

Perhaps it'd be better to rebase on the 0.10 branch than on master (although you could do master if you'd prefer having that be a breaking release for safety reasons).

You're then free to add these 4 @inlines directly, or seeing if this diff looks cleaner afterwards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the kc/simd branch, please check it out.

@chriselrod
Copy link
Contributor Author

chriselrod commented Aug 14, 2023

using SIMD.jl in https://github.com/KristofferC/HyperHessians.jl with some decent perf results.

Yeah, it might also be worth look at some of the existing alternative options before actually doing so, since some may be good starting points, especially as Pumas wants 2nd+ derivatives.

I think there was another package that was supposed to safe flops by being sparser (i.e., not calculating the same partials multiple times).

The worry is always the potentially long tail of polish/corner cases.
There have been a lot of long discussions over NaN handling, for example, that I haven't paid much attention to and would probably need to make sure I get right/satisfactory.

@chriselrod
Copy link
Contributor Author

Closing because this PR is no longer necessary (#570 incorporates the change).

@chriselrod chriselrod closed this Aug 14, 2023
@chriselrod chriselrod deleted the ce/simd branch August 14, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants